-
-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCA 16.0 from Ken #571
UCA 16.0 from Ken #571
Conversation
From Ken: Simply copy unidata-15.1.0d5.txt (7/28/2023 revision used for 15.1.0 release) to unidata-16.0.0d1.txt and update internal date, version, etc. Run existing 15.1.0 sifter executable with the 16.0.0 updated library for properties, case mapping, etc., to process unidata-16.0.0d1.txt. Verify that the output allkeys.txt is identical to the allkeys.txt released for 15.1.0, except for the generated date header. Current initial state archived as: unidata-16.0.0d1.txt (1550494 bytes, 10/06/2023) Process the diff between the released 15.1.0 UnicodeData.txt (UnicodeData-15.1.0d3.txt) and the current latest draft of the 16.0.0 UnicodeData.txt (UnicodeData-16.0.0d7.txt). Clean this up to just a list of all the 1177 new UnicodeData.txt records for 16.0. Run the results through a small transducer that snips out the fields not used for the unidata.txt input for the sifter. (This is just a simple utiity I have had for years -- it would be easy to replicate in Perl or Python, as needed.) The result is archived as: uc151to160add.txt (49212 bytes, 10/07/2023) That is the source of the data fields to paste into the evolving draft of unidata.txt for 16.0. I do it this way because the bookkeeping is all automatic. I need to find the right place in unidata.txt for all 1177 lines, and when the input from uc151to160add.txt has dwindled down to 0 lines left to transfer, I know I'm completely done.
From Ken: For this delta, search the input for all the new lines that can be intercalated into unidata.txt without affecting any primary weights or introducing any new secondary weights. 1. Move the new uppercase for 0264 (A7CB) into unidata.txt right below the entry for 0264. This just introduce the new uppercase, and does not impact the primary weight sequence at all. Verify by generating allkeys.txt and examining the diff. 2. Look for any new combining marks for Brahmic scripts that should be equated with the existing weights for Devanagari candrabindu, anusvara, and visarga. (This is a regular feature of DUCET now, to avoid the unnecessary proliferation of secondary weights for these for each new script that has them, because they are never mixed and matched across scripts.) The clear candidate for 16.0 is Tulu-Tigalari. The 3 characters in question are: 113CA;TULU-TIGALARI SIGN CANDRA ANUNASIKA;Mc;0901;;;;; 113CC;TULU-TIGALARI SIGN ANUSVARA;Mc;0902;;;;; 113CD;TULU-TIGALARI SIGN VISARGA;Mc;0903;;;;; Checking the latest proposal document (L2/23-031) verifies that "CANDRA ANUNASIKA" is the Tulu-Tigalari candrabindu analog. These three lines are copied into unidata.txt right below the corresponding entries for the Grantha analogs (closest related script, as well as sequential in code point order). The artificial decompositions 0901, 0902, and 0903 are manually added to the entries. Verify by generating allkeys.txt and examining the diff. 3. Rinse and repeat for the next new Brahmic scipt with any of these three characters: Gurung Khema: 1612D;GURUNG KHEMA SIGN ANUSVARA;Mn;0902;;;;; The order of intercalation for these entries is not critical, because they are just being equated to something else. But I put it after 11F03, KAWI SIGN VISARGA, to keep it in code point order in the input file. The artificial deocmposition 0902 is manually added to the entry. Verify by generating allkeys.txt and examining the diff. 4. Note that the Kirat Rai ANUSVARA, TONPI (a bindu), and VISARGA should *not* be equated to the Devanagari combining marks. Kirat Rai was deliberately encoded more like an alphabet, even though it minimally qualified as an abugida because of the inherent vowel and the existence of a killer. It is best for Kirat Rai to just give these three characters a primary weight in the code point order. So I skip over them for this draft of unidata.txt. They will be processed later when adding all the primary weights for Kirat Rai. So a no-op at this point in the processing. 5. Note any obvious sets of compatibility decompositions that will just result in more equivalences without impacting primary weights. The obvious set for 16.0 are the outlined Latin capital letters in the legacy computer symbols repertoire, 1CCD6..1CCEF. Move these entries into unidata.txt just before the very similar set of squared Latin capital letters, 1F130..1F149. Again, the exact placement in unidata.txt doesn't matter, because these will all end up equated to existing other weights, but putting 1CCD6..1CCEF in that location makes the parallel collation treatment obvious and makes these easier to track in the input file. No manual modification of the decomposition is needed, as these all already have formal compatibility decompositions. Verify by generating allkeys.txt and examining the diff. At this point, the obvious candidates (other than digits) have been taken care of. 31 down, 1146 to go. Note that to this point, not only is the diff for unidata.txt easy to examine, but also the diff for allkeys.txt is still well-formed and easy to interpret. Archive this delta 2: unidata-16.0.0d2.txt (1552253 bytes, 10/07/2023)
From Ken: Delta 3 processing will focus on the digits, which if intercalated correctly, will also not impact any primary weights. The work for these is fairly tedious for 16.0, because there are eight new sets of digits in the repertoire. This is also the point where it makes sense to come to grips with where to put all the new scripts in the overall order, so that the points where the digits are intercalated into the input file are reasonably consistent with each other and with the decision about the overall order of new scripts. Looking at all the new scripts, I suggest the following: Tulu-Tigalari should go right after Grantha. This matches code point order (intentional in the Roadmap) and these two are historically close. Todhri should go right after Vithkuqi. This matches code point order (intentional in the Roadmap) and these two are both Albanian scripts. Sunuwar should go right after Tangsa, another con-script used in the same general area of India, NE Myanmar and the Himalayas. Gurung Khema should go right after Sunuwar. It is also a con-script used in Nepal and Sikkim. Kirat Rai should go right after Gurung Khema. It is also a con-script used in NE India and Sikkim. The ordering of Tangsa > Sunuwar > Gurung Khema > Kirat Rai will also match the order of the block descriptions in the 16.0 core specification. Ol Onal should go right after Ol Chiki, another con-script for a Munda language spoken in the same general area of NE India. (This also matches the order of the block descriptions in the 16.0 core specification.) Garay is an African con-script for Wolof in Senegal. Its order is rather arbitrary, but in the core specification we put it after Medefaidrin (from Nigeria). We can do the same for UCA, which will put it in unidata.txt after Medefaidrin and before Adlam. 1. Myanmar Extended-C digits Myanmar Pa'O digits (116D0..116D9) and Myanmar Eastern Pwo Karen digits (116DA..116E3) aren't from a new script. They can be slotted in right after the Myanmar Tai Laing digit series. And they might as well be handled together as a set of two. For historic reasons, and in part because of the way the sifter works, digits have been accumulated together in unidata.txt by the digit values, so for each new set, the digit zero value needs to be intercalated in the set of other script digits for zero, etc. Verify by generating allkeys.txt and examining the diff. One of the reasons why I do this deliberately, and mostly one set of digits at a time is because I am also looking for any anomalies that might have crept into my update of the underlying library. Unlike in ICU, which I presume at this point just parses the entire UCD and autogenerates its numeric tables, I still do some of this work of table updates by hand, so occasionally I muff an update and need to make sure I'm getting the correct 0..9 values as expected for each new set of digits added in a version. 2. Sunuwar digits Sunuwar digits (11BF0..11BF9) get intercalated after Tangsa digits, per the above scheme. Verify by generating allkeys.txt and examining the diff. 3. Gurung Khema digits Gurung Khema digits (16130..16139) get intercalated after Sunuwar digits. Verify by generating allkeys.txt and examining the diff. 4. Kirat Rai digits Kirat Rai digits (16D70..16D79) get intercalated after Gurung Khema digits. Verify by generating allkeys.txt and examining the diff. 5. Ol Onal digits Ol Onal digits (1E5F1..1E5FA) get intercalated after Ol Chiki digits. Verify by generating allkeys.txt and examining the diff. 6. Garay digits Garay digits (10D40..10D49) get intercalated after Medefaidrin digits. Verify by generating allkeys.txt and examining the diff. 7. Outlined digits The outlined digits (1CCF0..1CCF9) from the legacy computer symbols repertoire should be treated analogously to the segmented digits. These all have explicit <font> compatibility decompositions, which will impact their tertiary weights. The entire range can be slotted into unidata.txt as a chunk, right after the list of segmented digits. Verify by generating allkeys.txt and examining the diff. At this point I'm done with the digits. I do one more comprehensive diff between the d2 and d3 version of allkeys.txt, to verify all still looks to be in correct order. 80 more down, 1066 to go. Archive this delta 3: unidata-16.0.0d3.txt (155520 bytes, 10/07/2023)
From Ken: After scanning the remaining new characters for possible candidates for intercalation without impacting primary or secondary weights, I spied the following candidates. 1. Nuktas There was a stray nukta added, 11F5A KAWI SIGN NUKTA. Most of the nuktas are also folded to a single secondary weight, so this one won't require a new secondary weight: 11F5A;KAWI SIGN NUKTA;Mn;093C;;;;; Note that an explicit decomposition to 093C is added here. Verify by generating allkeys.txt and examining the diff. 2. Vedic tone marks The pattern in DUCET for all Vedic tone marks is just to ignore them completely for collation. There are a couple Vedic tone marks in 16.0 added for Tulu-Tigalari. These can be added to the ranges of ignored Vedic accents in unidata.txt. Verify by generating allkeys.txt and examining the diff. 3. Garay combining marks Garay has four diacritic combining marks, 10D6A..10D6D, which do not participate in any canonical equivalences. The combining 10D69 GARAY VOWEL SIGN E is highly significant to the orthography, so must get a primary weight. 10D6A;GARAY CONSONANT GEMINATION MARK;Mn;;;;;; 10D6B;GARAY COMBINING DOT ABOVE;Mn;;;;;; 10D6C;GARAY COMBINING DOUBLE DOT ABOVE;Mn;;;;;; 10D6D;GARAY CONSONANT NASALIZATION MARK;Mn;;;;;; All 4 of these are non-spacing marks above. The proposal is unclear about the collation implications of the gemination mark, but it seems safe to presume it should be given a secondary weight and not be ignored. The dot above and double dot above are effectively two nuktas. The dot above is used on one letter to mark a native Garay ŋ, as opposed to a prenasalization. The dot above and double dot above are used as diacritics on another letter to indicate two borrowed Arabic sounds. That means the dot above and double dot above need to be distinguished from each other. For these diacritic marked letters there are no atomic encodings -- they can only be represented as sequences. The collation information in the proposal indicates that the sequences (e.g., the two sequences representing Arabic sounds: <10D76, 10D6B>, <10D76, 10D6C>) should get primary weights. DUCET does this kind of thing for atomic characters which have canonical decompositions, but it does *not* do it for arbitrary sequences, since there is no "target" in the encoding to assign the primary weight for the sequence to. Conceivably, the sifter apparatus could be extended to allow for weighting these ghost primaries, but not now for the 16.0 draft. The nasalization mark (10D6D) isn't clearly exemplified, and the proposal claims it is ignored for collation. The gemination mark can occur over the GARAY VOWEL SIGN E and also over the COMBINING DOT ABOVE. At a minimum, its secondary weight should be distinguished from the COMBINING DOT ABOVE. There are other complications for the Garay collation which mean it cannot be fully handled by a default ordering in DUCET anyway. The net conclusions I draw for now are that: a. 10D69 GARAY VOWEL SIGN E needs a primary weight. b. 10D6A GARAY CONSONANT GEMINATION MARK should get a script-specific secondary weight. c. 10D6B GARAY COMBINING DOT ABOVE can be weighted as generic above [0033] d. 10D6C GARAY COMBINING DOUBLE DOT ABOVE can be weighted as generic nukta [00C2] e. 10D6D GARAY COMBINING NASALIZATION MARk can be weighted as generic above [0033] a) and b) will be dealt with later when doing all the primary weighting for Garay. c), d) and e) won't impact primary or secondary weights already in DUCET, so I am dealing with them now, and bleeding them out of the set of the Garay characters to be weighted later. The relevant hacks for them are: 10D6B;GARAY COMBINING DOT ABOVE;Mn;F8F5;;;;; 10D6C;GARAY COMBINING DOUBLE DOT ABOVE;Mn;093C;;;;; 10D6D;GARAY CONSONANT NASALIZATION MARK;Mn;F8F5;;;;; to equate them to a generic above secondary weight (F8F5) and to the generic nukta (093C). These entries are added to unidata.txt in the section dealing with secondary weighting, just ahead of the section discussing Adlam secondaries, with notes explaining the generic weight decisions. Verify by generating allkeys.txt and examining the diff. This concludes all the apparent candidates that don't require new primary or secondary weights -- although there are possibly a couple other gc=Mn that might qualify. I'm deferring those to further analysis on a per-script basis. 6 more down, 1060 to go. Archive this delta 4: unidata-16.0.0d4.txt (1556145 bytes, 10/07/2023)
From Ken: I'm now at the point where further updates to unidata.txt will render the output file (allkeys.txt) effectively undiffable. I've never felt it was worth the effort to try to write custom tooling that would keep track of all the relative differences in weights and report changes on those differences. I suppose that is an opportunity for somebody who feels ambitious. 1. New Latin case pair A7CC/A7CD is a new Latin case pair for s with diagonal stroke. The proposal is silent about collation for these, so I'm just intercalating them with a primary weight difference after A7A9/A7AA s with short stroke overlay. Generate allkeys.txt and verify that A7CC/A7CD show up properly weighted as a case pair following A7A9/A7AA in primary order. 2. New Cyrillic case pair 1C89/1C8A is a new Cyrillic case pair for tje. The proposal is silent about collation for these, but states that this is a Khanty letter for [t'], i.e. [tʲ]. I'm intercalating it between Cyrillic letter twe and Cyrillic letter Komi tje, with a primary weight distinction. Generate allkeys.txt and verify that 1C89/1C8A show up properly weighted as a case pair following A68C/A68D in primary order. 3. Archaic letter SHRI for Kannada and Telugu 0C5C;TELUGU ARCHAIC SHRII;Lo;;;;;; 0CDC;KANNADA ARCHAIC SHRII;Lo;;;;;; The proposal claims that these can be collated as equivalent to the word SHRII, i.e., the spelled out sequences. I'm adding the relevant sequences, i.e. SHA-virama-RA-II, as a <sort> decomposition for each of these two: 0C5C;TELUGU ARCHAIC SHRII;Lo;<sort> 0C36 0C4D 0C30 0C40;;;;; 0CDC;KANNADA ARCHAIC SHRII;Lo;<sort> 0CB6 0CCD 0CB0 0CC0;;;;; Generate allkeys.txt and verify that these two show up weighted correctly by the decomposition. 4. Arabic Pegon letters The proposal is silent about collation for these 4 additions. 0897;ARABIC PEPET;Mn;;;;;; Because the pepet character was introduced to have a proper form for what is currently commonly represented with a maddah, I give it a <sort> equivalence to 0653 and intercalate it in unidata.txt right after maddah. This is likely to provide better behavior for Pegon material that may mix maddah and (in the future) 0897 pepet for this. Revised entry in unidata.txt looks like this: 0897;ARABIC PEPET;Mn;<sort> 0653;;;;; 10EC2;ARABIC LETTER DAL WITH TWO DOTS VERTICALLY BELOW;Lo;;;;;; 10EC3;ARABIC LETTER TAH WITH TWO DOTS VERTICALLY BELOW;Lo;;;;;; 10EC4;ARABIC LETTER KAF WITH TWO DOTS VERTICALLY BELOW;Lo;;;;;; Default collation for these nuktated Arabic letters is fairly arbitrary. I intercalated 10EC2 just ahead of 0759. I intercalated 10EC3 just after 088C, and 10EC4 just after 08B4. Generate allkeys.txt and verify that these pepet shows up weighted as equivalent to maddah and the other 3 have primary weights in the expected order. 5. Arabic combining alef overlay The proposal is not explicit about collation for 10EFC, but implies that it should be treated as equivalent to 0670 ARABIC LETTER SUPERSCRIPT ALEF. I am simply equating it to 0670 in DUCET, similar to the way several variant forms of maddah are equated to 0653. Generate allkeys.txt and verify that 10EFC is weighted identically to 0670. 11 more down, 1049 to go. Archive this delta 5: unidata-16.0.0d5.txt (1557279 bytes, 10/07/2023)
From Ken: O.k., a new day. Time to start in on the meat of the matter -- the whole script additions in primary order. I start with the unicameral script additions. Those are a bit simpler than any new bicameral scripts (Garay for 16.0). (See above in the Delta 3 discussion for the detailed rationale for the placement of the new scripts -- I won't replicate that discussion here, but just proceed, based on the placement decisions already made.) 1. Kirat Rai I start with this one, even though it has some complexities, because it is fresh in our minds from the extended discussion of the unicodetools PR. First move the entire range of Kirat Rai alphabetic records (in code point order) into unidata-16.0.0d6.txt, starting after Tangsa. This includes all the consonants, all the vowels signs, which in Kirat Rai are actually full standalone letters, and the two killers: 16D40;KIRAT RAI SIGN ANUSVARA;Lm;;;;;; ... 16D6C;KIRAT RAI SIGN SAAT;Lm;;;;;; I omit the three punctuation marks for now. Those end up elsewhere in DUCET, and it is more efficient to deal with all the punctuation additions in a separate delta later on, after the alphabetic runs have been established. As noted in discussion, the ANUSVARA, TONPI (bindu), and VISARGA for Kirat Rai are encoded as standalone modifier letters, rather than as combining marks, and we've already decided not to try equating them to the various combining candrabindu, visarga, etc., that use generic weights shared with the Devanagari archetypes of the combining marks. The proposal suggests that they be given primary distinctions and be left in code point order. It might be better to move them to the end of the list of consonants, but without further rationale provided, the simplest solution here is to simply do as the proposal suggests. Indeed, the proposal states that "This sort order with anusvara, tonpi, and visarga sorting first has been approved by AKRS." So we just leave it that way for DUCET. The next complication results from canonical equivalences for three Kirat Rai vowels: AI, O, AU. In cases like this, the way to get the sifter to introduce contractions is to surround the records in question with the CONTRACTION pragma, to wit: CONTRACTION 16D68;KIRAT RAI VOWEL SIGN AI;Lo;16D67 16D67;;;;; 16D69;KIRAT RAI VOWEL SIGN O;Lo;16D63 16D67;;;;; 16D6A;KIRAT RAI VOWEL SIGN AU;Lo;16D69 16D67;;;;; DEFAULT In cases like this, as for Tamil, Kannada, etc., etc., I also drop a comment into unidata.txt with a somewhat redundant explanation, to remind everybody what is going on here. The effect of the CONTRACTION pragma is to tell the sifter that for the range of entries where it is in effect, the sifter is to go ahead and assign a primary weight to the code point and *also* generate a contraction entry from the decomposition, giving it the same weight as the atomic character code point. In the absence of the CONTRACTION pragma, such an entry is instead just entered into allkeys.txt with the sequence of weights from the decomposition, and does not have its own primary weight. But wait, there's more. Because of the strange encoding of Kirat Rai vowel signs, we have a canonical closure problem for 16D6A. The full decomposition for 16D6A is <16D63, 16D67, 16D67>, and we need to weight that sequence with the same primary weight via contraction. Fortunately, this is not the first time this problem has been encountered for the sifter. A similar problem of canonical closure for a recursive canonical decomposition occurs for 0CCB in Kannada and 0DDD in Sinhala. The mechanism baked in to the sifter to deal with this is a "secondary decomposition", which can be added to the input entry in the decomposition field. As a first step for handling 16D6A, I'm putting the following entry into unidata-16.0.0d6.txt: 16D6A;KIRAT RAI VOWEL SIGN AU;Lo;16D69 16D67, 16D63 16D67 16D67;;;;; The comma delimitation in the decomposition allows for adding a secondary decomposition. When enclosed within the CONTRACTION pragma, this generates a second contraction using the secondary decomposition information. This *almost* solves the problem for Kirat Rai, as it was solved for Kannada and Sinhala. Unfortunately, the way the Kirat Rai vowels work, there is yet *another* sequence that is canonically equivalent: <16D63, 16D68>. That sequence is equivalent to <16D63, 16D67, 16D67>, so it also needs a contraction that is weighted with the same primary weight. The sifter code currently treats the Kannada and Sinhala cases essentially as the extent of the problem -- only a *single* secondary contraction is allowed for in the code. The syntax is for this currently is decompfield := decomp (, decomp)? rather than decompfield := decomp (, decomp)* because the code was written to just look for the comma and then process a *single* secondary decomposition value, rather than being written to expect and process an indefinite *list* of secondary decompositions. To fix this for Kirat Rai (and future-proof against any similar cases in the future), I'm going to have to do some considerable refactoring of the relevant decomposition handling code in the sifter, which is a tricky and sensitive part of the code. For now I am just going to postpone that code work until all the rest of the 16.0 input for UCA has been taken care of. But in the meantime, while adding the first relevant secondary decomposition for 16D6A to unidata.txt, I also reversed the order of the secondary decomposition for the 0CCB and 0DDD entries in unidata.txt. Now the field with the secondary decomposition better matches what the code states, assuming the *first* entry is the formal canonical decomposition string from the UCD (which is then recursively decomposed internal to the sifter processing), followed by a secondary decomposition, which in the Kannada and Sinhala cases is the full decomposition. The impact on the output in allkeys.txt is just to invert the order of two contraction lines, but it does not affect any of the weighting per se. The other side effect is that the output log will stop warning about encountering a "Non-binary" canonical decomposition for 0CCB and 0DDD in the recursive decomposition. Generate allkeys.txt and verify that Kirat Rai weights are as expected, with special attention to the results for 16D68, 16D69, and 16D6A. Also examine the impact of the secondary decomposition change for 0CCB and 0DDD. O.k., this discussion for Kirat Rai and its implications is hairy enough that I'm going to make this its own delta, without introducing more scripts into this set of changes. 45 more down, 1004 to go. Archive this delta 6: unidata-16.0.0d6.txt (1559396 bytes, 10/08/2023)
From Ken: 1. Todhri Given all the complications of Kirat Rai to start off the day, I'm rewarding myself before lunch by dealing with an easy case: Todhri. This is just a straight unicameral alphabet, with no complications other than two letters that have canonical equivalent sequences. Move the relevant entries for Todhri (105C0..105F3), in code point order, into unidata.txt, right after Vithkuqi. Apply the CONTRACTION pragma to the two decomposed vowels, 105C9 and 105E4. Generate allkeys.txt and verify that Todhri weights are as expected, including the two contractions. 2. Sunuwar This is another simple one, another simple unicameral alphabet with no marks, and with the desired collation order the same as the code point order. Move the relevant entries for Sunuwar (10BC0..11BE0), in code point order, into unidata.txt, right after Tangsa (and ahead of the Kirat Rai I just added). Leave the one punctuation sign to deal with later. Generate allkeys.txt and verify that Sunuwar weights are as expected. 3. Gurung Khema Gurung Khema is a bit more complicated. This one is an abugida, and it has decomposition and contraction issues for the vowel signs. First move all the relevant entries for Gurung Khema (16100..1612F), in code point order, into unidata.txt, right after Sunuwar. The 8 multi-part vowels with decompositions, 16121..16128, need to have the CONTRACTION pragma, as the intent is for the vowels to all get primary weights. 3 of the multi-part vowels, 16126..16128, have full decompositions into sequences of three parts. Because of this, as for Kirat Rai discussed above, those three need to have the full decompositions added in their entries as secondary decompositions. The entries affected are: 16126;GURUNG KHEMA VOWEL SIGN O;Mn;16121 1611F, 1611E 1611E 1611F;;;;; 16127;GURUNG KHEMA VOWEL SIGN OO;Mn;16122 1611F, 1611E 16129 1611F;;;;; 16128;GURUNG KHEMA VOWEL SIGN AU;Mn;16121 16120, 1611E 1611E 16120;;;;; A replication note for when trying to build allkeys.txt with sifter in the unicodetools: Before the sifter will work correctly for weighting of abugidas, the Alphabetic property has to be updated for the repertoire in question. In particular, all gc=Mn or gc=Mc vowel signs, consonant signs, and length marks in abugidas need to be set explicitly to Other_Alphabetic in PropList.txt first (and the relevant derivations be run based on that). Otherwise, during the sift process, the sifter won't see these as alphabetic and branch down the path for primary weights, but rather will identify them as otherwise unaccounted for combining marks, and attempt to give them secondary weights. Anusvaras and visargas also should be set to Other_Alphabetic, but those are already bled off in unidata.txt by being given explicit decompositions to generic marks. Another piece of the puzzle is that nuktas and viramas (including killers) should be given the Diacritic property in PropList.txt, but these are more marginal for sifter behavior. Most nuktas are now bled off with explicit decompositions, and the viramas are almost all picked up in the sifter via their ccc=9 values. This could become a problem in the future if SAH insists on ccc=0 for some newly encoded viramas, at which point the sifter code may need an update to catch any combining mark viramas (or conjoiners and killers) with ccc=0. The example we have for 16.0, in Kirat Rai, is not a problem, because that is gc=Lm, ccc=0, so the sifter gets its Alphabetic status from gc=Lm and assigns it a primary weight. Generate allkeys.txt and verify that Gurung Khema weights are as expected, with special attention to the vowel contractions. 4. Tulu-Tigalari First move all the relevant entries for Tulu-Tigalari (11380..113D0), in code point order, into unidata.txt, right after Grantha. Put in the CONTRACTION pragma for the 3 two-part dependent vowel signs, 113C5, 113C7, 113C8. Do the same for each of the 4 two-part independent vowel signs, 11383, 11385, 1138E, 11391. Those aren't contiguous in code point order, so should use multiple entries for the pragma, to make sure they don't pick up entries that they shouldn't. Now, checking against the detailed specification of the collation order in the proposal (L2/22-031), invert the order of 113B3 LLA and 113B4 RRA, so the collation order is RRA < LLA. That seems to be a deliberate choice in the proposal. Next move 113D1 TULU-TIGALARI REPHA into unidata.txt, immediately after the RA (113AC). The repha is a separately encoded form of ra. Note that for the Tulu-Tigalari vowels, there are deliberate encoding gaps for short e and short o. Those might be added to the encoding later on, in which case they would intercalate neatly in the gaps, and would fit in the same places in the primary collation order. There is one anomaly in the specification of collation in that it specifies the primary order for vowel sign o, even though that is not encoded. There is also a typo indicating: vowel sign vocalic ll << vowel sign ee. That should be a primary distinction, like all the rest. Ignored. The au length mark (113C8) only occurs as the second part of some two-part vowels, and is basically would not be weighted alone in most text, because it is basically bled by contractions that form the weights for atomically encoded two-part vowels. It makes more sense to give it a primary order *after* the viramas, so I have reversed the position in unidata.txt, as compared to the specification in the proposal. See the treatment for Grantha, which has similar components. The pluta (113D3) is not in L2/22-031. It was added later, based on Srinidhi and Sridatta's L2/22-260. L2/22-260 is silent about its ordering. It is a letter that serves as a different kind of vowel lengthener. I'm giving it a primary order after the au length mark. Again, see the comparable treatment of the same component in Grantha. The gemination mark (113D2) is also not in L2/22-031, but comes from L2/22-260, which is silent about its ordering. However, the comparison is made there to Gurmukhi addak (0A71), Khojki sign shadda (11237), and Soyombo gemination mark (11A98). For DUCET, 0A71 is given a Gurmukhi-specific secondary weight. The Khojki shadda is simply equated to the Arabic shadda. The Soyombo gemination mark is given a Soyombo-specific secondary weight. On balance, it seems best to just add a new secondary weight for the Tulu-Tigalari gemination mark. I defer that to later, along with any other new secondary weight additions required. Remember that Garay is also introducing a gc=Mn gemination mark, so I have to figure out how to deal with that one, too. So later. Regenerate allkeys.txt, and verify that the Tulu-Tigalari weights are as expected, with special attention to the various vowel contractions and to the few other characters that receive primary weight not in code point order, as noted above. 5. Ol Onal Ol Onal is another easy one. It is a simple alphabet. The proposal (L2/22-151R) specifies that the collation order is simply the same as the encoding order, however, the discussion of the use of the two combining marks, MU (nasalization, a dot above) and IKIR (lengthening, a dot below) suggests to me that it makes more sense to give them secondary weights. That is, rather than what is specified in the proposal: A < A+MU < A+IKIR < A+IKIR+MU = A+MU+IKIR what probably makes more sense for ordering is: A << A+MU << A+IKIR << A+IKIR+MU = A+MU+IKIR which would be accomplished better with secondary weights for MU and IKIR. The proposal compares these two marks to the corresponding marks in Ol Chiki, which are spacing and given gc=Lm, and the corresponding marks in Nag Mundari, which are non-spacing diacritics. For best consistency, I think we should follow the pattern of Nag Mundari, which gives the two non-spacing marks secondary weights, rather than the Ol Chiki pattern, where the spacing modifier letters get primary weights. In any case, since the preferred solution involves assigning new secondary weights, I defer the MU and IKIR to a later draft when I deal with those. So for now, for the rest of the alphabet, I move all the letters (1E5D0..1E5ED) and the HODDOND (1E5F0), in code point order, into unidata.txt right after Ol Chiki. Regenerate allkeys.txt, and verify that the Ol Onal weights are as expected. I'll save Garay for the next delta. 233 more down, 771 to go. Archive this delta 7: unidata-16.0.0d7.txt (1569698 bytes, 10/08/2023)
From Ken: 1. Garay I've saved Garay for a separate delta, because it is the only new bicameral script in the bunch, and that introduces intercalation complications for it in unidata.txt. It also has a sukun, a gemination mark, and a reduplication mark. The proposal implies it needs a syllabic ordering, rather than a simpler ordering. It also has a couple variant letters, which need special handling. Garay goes into unidata.txt after Medefaidrin, another West African bicameral script and before Adlam, another West African bicameral script. The first step is to move all the capital and small letters into unidata.txt. Then I rearrange them into case pairs, in the same manner as for Medefaidrin and Adlam. Note that this rearrangement is not strictly necessary to get weight assignments correct for the case pairs, but it is better to keep maintaining new bicameral scripts in the same way as for existing ones already in unidata.txt. This consistency helps in understanding what is going on. Next, deal with OLD KA (10D64/10D84) and OLD NA (10D65/10D85). These are claimed to just variant forms of KA and NA, respectively, and are claimed explicitly to sort equal to them. I merge them into the case pair bundles for KA and NA, with a <sort> decomposition to make them sort similarly to KA and NA, but with a case-distinguished tertiary weight difference. It is completely unclear how to weight the vowels. There is an extensive discussion of collation in L2/22-048, but the main upshot of that seems to be that collation is conceived of in terms of a syllabic grid, rather than in terms of the actual string used to represent each node of the syllabic grid. I default to putting the vowels in code point order ahead of all the consonants. Any attempt to implement the actual syllabic ordering would require an extensive tailoring. Just putting the vowels first, in roughly the order specified for vowels in syllables, would seem to suffice for the default ordering in DUCET. The gemination mark (10D6A) will get a script-specific secondary, so I defer that for now. Regenerate allkeys.txt, and verify that the Garay weights are as expected, including all the case pairs and the equivalents for the two pairs of variant letters. 51 more down, 720 to go. Archive this delta 8: unidata-16.0.0d8.txt (1572176 bytes, 10/08/2023)
From Ken: All of the new scripts have been dealt with (at least for all the basic letters and their primary orders). Next I move on to clear away the enormous pile of miscellaneous symbols (gc=So) for 16.0, most of which are associated with the new block of computer legacy graphic symbols. These new symbols are just added to unidata.txt in the appropriate symbols sections in code point order. All will get primary weights assigned, but will be marked as variables for collation. 1. Symbols added in the 2400 block Three new symbols (2427..2429) were added to the 2400 block. Add them to unidata.txt after 2426, in code point order. Regenerate allkeys.txt, and verify that these three symbols weight as expected. 2. Symbols added in the new 1CC00 block for legacy computing symbols Several hundred symbols in the range 1CC00..1CEB3 were added in this block. The outlined digits and letters were already dealt with above. Add the rest of these entries into unidata.txt in code point order, ahead of the run of entries for the Symbols for Legacy Computing block at 1FB00. Regenerate allkeys.txt, and verify that this range of symbols weights as expected. 3. Various arrows added to the Supplemental Arrows-C 1F800 block Ten new arrows (1F8B2..1F8BB) were added to the 1F800 block. Add them to unidata.txt after 1F8B1, in code point order. Regenerate allkeys.txt, and verify that this range of symbols weights as expected. 4. More legacy computing symbols 37 more legacy computing symbols (1FBCB..1FBEF) were added to the 1FB00 block. Add them to unidata.txt, after 1FBCA, in code point order. Regenerate allkeys.txt, and verify that this range of symbols weights as expected. 700 more down, 20 to go. Woot! Home stretch. Archive this delta 9: unidata-16.0.0d9.txt (1602275 bytes, 10/08/2023)
From Ken: For this delta, focus on all the remaining script-specific punctuation and symbols. 1. Dandas and double dandas Tulu-Tigalari and Kirat Rai both have script-specific dandas. Intercalate them into unidata.txt roughly in code point order. This puts Kirat Rai after Mro in the list of dandas, and it puts Tulu-Tigalari after Khojki in the list of dandas. The Balinese inverted carik siki and inverted carik pareren are also a variety of danda and double danda. Those can be interlaced with the non-inverted danda and double danda for Balinese. Regenerate allkeys.txt, and verify that these six dandas weight as expected. 2. Miscellaneous other punctuation 1B7F BALINESE PANTI BAWAK is a variant of 1B5A BALINESE PANTI. Just intercalate after 1B5A. 10D6E GARAY HYPHEN is another script-specific hyphen. Add to the list of script-specific hyphens in the dashes and hyphens subsection of punctuation. 113D7..113D8, the two Tulu-Tigalari pushpikas can just be intercalated in the miscellaneous punctuation section of unidata.txt, after the Khojki abbreviation sign and before the Newa punctuation. 16D6D KIRAT RAI SIGN YUPI is another Indic abbreviation sign. Add in the script-specific miscellaneous punctuation section of unidata.txt, in code point order. 11BE1 SUNUWAR SIGN PVO is an auspicious mark, similar in some ways to a siddham mark or a Devanagari bhale. These may have particular pronunciations, but are treated as punctuation marks. Just add to unidata.txt in the script-specific section of miscellaneous punctuation in code point order. That will put it right after 119E2 NANDINAGARI SIGN SIDDHAM. 1E5FF OL ONAL ABBREVIATION SIGN. Likewise just add in the script-specific miscellaneous punctuation section in code point order. Regenerate allkeys.txt, and verify that these seven punctuation marks weight as expected and are indicated as variables. 3. Miscellaneous other symbols Garay plus sign (10D8E) and minus sign (10D8F). In the absence of any better information about these, just intercalate in the math symbols section of unidata.txt after 002B PLUS SIGN and 2212 MINUS SIGN, respectively. Regenerate allkeys.txt and verify that these two symbols weight as expected and are indicated as variables. 4. Garay reduplication mark 10D6F GARAY REDUPLICATION MARK has its properties misconstrued. It is explained in the proposal (L2/22-048) under section 4, Punctuation. However, most examples of iteration or reduplication marks are designated as gc=Lm, Extender=True in the UCD. This keeps the reduplication or iteration mark within the context of the word for segmentation purposes, which is usually the desired outcome. A few similar marks have ended up as gc=Po. But the Garay proposal specifies the mark as gc=So, which is clearly wrong. I am updating UnicodeData.txt for 16.0 to change this character from gc=So to gc=Lm, and then updating the underlying library for the sifter accordingly. With the updated interpretation of properties, 10D6F can be intercalated in the extenders section of unidata.txt. I put it between AAF4 MEETEI MAYEK WORD REPETITION MARK and 16B42 PAHAWH HMONG SIGN VOS NRUA. Regenerate allkeys.txt and verify that these 10D6F shows up with a primary weight and is grouped with the extenders. 16 more down, 4 to go. Archive this delta 10: unidata-16.0.0d10.txt (1602884 bytes, 10/08/2023)
From Ken: At this point there are only 4 more characters to deal with: the Garay and Tulu-Tigalari gemination marks and the nasalization and lengthener marks for Ol Onal. All four are gc=Mn and as discussed above should probably end up with secondary weights. Because the addition of script-specific secondary weights requires some corresponding code changes to the sifter code, I have saved these four to the last. I will provide the intercalation points for unidata.txt, but then also spell out in detail what corresponding changes need to be made to the source code to enable the extension of the list of defined secondary weights. At that time I will also update the various dates and version information in the sifter source code, so that allkeys.txt and other output files get stamped with the correct versions and dates. 1. Ol Onal MU and IKIR (1E5EE..1E5EF). Thinking about these overnight, there doesn't seem to be any strong case for these requiring *script-specific* secondary weights. They are the only combining marks used in Ol Onal. One is just a dot above and the other is just a dot below (and only used on one letter). Their collation implications can be handled simply be equating them to the generic above and generic below secondary weights. Intercalate 1E5EE and 1E5EF into unidata.txt in the combining marks section after the 4 Nag Mundari combining marks, and weight them as generic above and generic below. Regenerate allkeys.txt and examine the diff against allkeys-16.0.0d10.txt. (This can be done with a diff, because the two marks are being given existing secondary weights.) 2. Garay and Tulu-Tigalari gemination marks (10D6A, 113D2) These were discussed at some length above. For consistency with similar cases in other scripts, a solution that gives them script-specific secondary weights seems best. Both occur in scripts that have other combining marks that these gemination marks should be distinguished from. Introducing new secondary weights in the sifter is much less automatic than just letting the sifter assign new primary weights. Because the sifter's generation of allkeys.txt is still tightly coupled to the generation of the symbolic forms used for the CTT table for ISO 14651, the code needs to be touched to introduce new secondary weight symbols in the correct order. (At some point soon, we might decide to give up on generating the CTT table for ISO 14651, but that is a separate decision that would require a significant overhaul of the code to clean up the sifter, and shouldn't be considered now simply to avoid the work for two new secondaries for 16.0.) So first I detour to do the 16.0 updating of the sifter. unisift.c Update the two version strings and the versioned file names to 16.0.0. The copyright year hasn't (yet) changed since this code was last touched for 15.1. For consistency in documentation, update the NOTDEF section of the switch statement in unisift_ForceToSecondary() for the three Tulu-Tigalari and one Gurung Khema mark that were equated to generic anusvara, visarga, etc. This doesn't impact the actual code, but is good practice for bookkeeping on these exceptions. unisyms.c Bump up the NUMSECONDSYMS constant from 261 to 263 for the two new weights to be added. There are no strong rules for which order new secondaries need to be intercalated in, and no obvious positions, since Tulu-Tigalari and Garay are both new. To keep things simple, I chose to add both of them together, right after the existing entry for the Soyombo gemination mark in secondSyms[]: "<D11A98>", /* Soyombo gemination mark */ "<D10D6A>", /* Garay gemination mark */ "<D113D2>", /* Tulu-Tigalari gemination mark */ That requires a corresponding addition of two entries to secondSymVals[]: /* Soyombo, Garay, Tulu-Tigalari */ 0x11A98, 0x10D6A, 0x113D2, And then, the actual two entries for 10D6A and 113D2 need to be moved into unidata.txt in the same relative position, right after 11A98. I also made two small tweaks to ranges in dumpCollatingSymbols() to ensure that those ranges pick up the new Gurung Khema and Symbols for Legacy Computing Supplement blocks when dumping collating symbols for the CTT for 14651. Corresponding tweaks needed to be made to the ranges in unisift_BuildSymWtTree(). This manual insertion of secondary marks is, of course, very tedious and error-prone in cases where several new marks need to be added for a version. It would be helpful if it could be automated a bit more, but the work involved has never risen to the level where it could offset the considerable effort that would be required to figure out and implement actual automation here. The whole issue of secondary weights in DUCET has seen an extensive amount of custom tinkering over the years, in large part to keep the inflation of secondary values under control, and to avoid breaking out of the magic number range: 0021..01FF for secondaries, so the DUCET table could keep the lowest primary weight stable at 0200. We would have broken the bank years ago, if not for the considerable work in custom folding of lots of secondary weights for marks in different scripts into common, generic secondary weight values. Rebuild the sifter and deploy. Regenerate allkeys.txt and verify that the two new secondary weights have been added correctly, and that the secondary weight range shows in the output diagnostics as having been bumped up to 263. My first run for this in fact turned up an underlying problem in my library when I checked for the magic number 263 and came up one short in the output. I had mistakenly given an Alphabetic value to 10D6A, which resulted in the sifter giving it a primary weight, instead of the desired secondary weight next to the other two gemination marks. Correction of the library resulted in the expected output in allkeys.txt: 11A98 ; [.0000.00D2.0002] # SOYOMBO GEMINATION MARK 10D6A ; [.0000.00D3.0002] # GARAY CONSONANT GEMINATION MARK 113D2 ; [.0000.00D4.0002] # TULU-TIGALARI GEMINATION MARK For these secondary weight changes, another necessary check is to verify that the CTT generation picked up the correct new symbols. For that I check the ctt14651.txt output file: collating-symbol <D11A98> % SOYOMBO GEMINATION MARK collating-symbol <D10D6A> % GARAY CONSONANT GEMINATION MARK collating-symbol <D113D2> % TULU-TIGALARI GEMINATION MARK ... <D11A98> % SOYOMBO GEMINATION MARK <D10D6A> % GARAY CONSONANT GEMINATION MARK <D113D2> % TULU-TIGALARI GEMINATION MARK ... <U11A98> IGNORE;<D11A98>;<MIN>;<SFFFF> % SOYOMBO GEMINATION MARK <U10D6A> IGNORE;<D10D6A>;<MIN>;<SFFFF> % GARAY CONSONANT GEMINATION MARK <U113D2> IGNORE;<D113D2>;<MIN>;<SFFFF> % TULU-TIGALARI GEMINATION MARK And everything seems to be in order there. If the tables in unisyms.c are not correctly updated, these symbol assignments for the CTT can easily end up with off-by-one errors, throwing the table completely out of whack. 4 more down, 0 to go. Yay! Archive this delta 11: unidata-16.0.0d11.txt (1603293 bytes, 10/09/2023) Generate decomps-16.0.0d11.txt (sifter -t unidata-16.0.0d11.txt) to document the changes in decompositions for this version of UCA. I diff this file against the released version for 15.1.0: decomps-15.1.0d4.txt to see what changed. It shows all the new decompositions, including all the synthetic decomposition additions for collation. It also shows the order change for the two secondary decompositions for Kannada and Sinhala, discussed above. ---- From Ken via email: As usual, there are also some small modifications to the sifter source code, particularly to deal with the introduction of new secondary weights for UCA 16.0. I was able to pare away various secondaries and get the additions down to just two new secondary weights -- but any addition of a secondary requires some work on the sifter code. And there are always version and a few range updates required for a new version, as well. Note that handling the full new set (1177 characters) for Unicode 16.0 in the sifter requires that some character properties outside of UnicodeData.txt also be updated correctly. The most important of these is Alphabetic, which depends on Other_Alphabetic in PropList.txt. To a lesser extent, the values for Diacritic and Extender (also in PropList.txt) might impact a few weight assignments. I very strongly recommend that before going much further on script-by-script UCA work, particularly for the abugidas, that you first take a break to do a complete update of PropList.txt for 16.0. Neglecting that step will just lead to confusion and characters out of place later on in the UCA work, particularly when you get to Tulu-Tigalari.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spot-checking commits. Looks good so far.
(Closing and reopening to rerun CI.) |
Ken has sent more updates to the data and to the sifter code. I will try to catch up with those “soon”. |
From Ken: I've done the code refactoring that enables the processing of more than one secondary decomposition in the input file, so we could get the full set of contractions working for the Kirat Rai vowel au. There was a very substantial refactoring of the code related to processContractions(). Instead of the prior one-off branch that checked for a secondary decomposition and handled it specially, the code now assumes that a decomposition in the input file is a comma-separated list which usually defaults to a single entry, if present. However, rather than refactoring the code so that it could handle indefinite lists of decompositions for contractions, I just have it work now with a static array of up to four decompositions. For over a decade, we've been able to get by with two. Kirat Rai forces us up to three. I expect it will take awhile to find a situation that requires four -- but if we do eventually need to go there, the code will continue to just work with no further updates. Doing it this way avoided even *more* extensive changes to the code to build up and tear down dynamic lists for this extremely edgy edge case.
From Ken: I've updated my library and the sifter input file, and now have a new draft of allkeys.txt including the 15 characters added at UTC-177. The intercalations were all easy, and an examination of the resulting weights looks fine to me.
I have caught up with Ken's recent changes, and I am taking this out of draft for review and merging. I will look into the UCA test failure separately; at a minimum, I will need to hardcode sample characters for future scripts. |
From Ken: I removed the reference to ustbuild, which is a separate utility that I used to house in my sifter directory, but which [...] I moved out to a separate directory with a separate make.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look like they have caught up o.k.
Successive changes from Ken for adding new characters to the default sort order.
Files from Ken taken verbatim; commit messages quoting Ken's progress notes (notes per delta from Ken's UCA16Journal.txt in each corresponding commit).